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Abstract 

A statistical classification algorithm and its ap- 
plication to language identification from noisy 
input are described. The main innovation is to 
compute confidence limits on the classification, 
so that the algorithm terminates when enough 
evidence to make a clear decision has been 
made, and so avoiding problems with categories 
that have similar characteristics. A second ap- 
plication, to genre identification, is briefly ex- 
amined. The results show that some of the 
problems of other language identification tech- 
niques can be avoided, and illustrate a more im- 
portant point: that a statistical language pro- 
cess can be used to provide feedback about its 
own success rate. 

1 Introduction 

Language identification is an example of a gen- 
eral class of problems in which we want to as- 
sign an input data stream to one of several cat- 
egories as quickly and accurately as possible. It 
can be solved using many techniques, including 
knowledge-poor statistical approaches. Typi- 
cally, the distribution of n-grams of characters 
or other objects is used to form a model. A 
comparison of the input against the model de- 
termines the language which matches best. Ver- 
sions of this simple technique can be found in 
Dunning (1994) and Cavnar and Trenkle (1994), 
while an interesting practical implementation is 
described by Adams and Resnik (1997). 

A variant of the problem is considered by 
Sibun and Spitz (1994), and Sibun and Rey- 
nar (1996), who look at it from the point of 
view of Optical Character Recognition (OCR). 
Here, the language model for the OCR system 
cannot be selected until the language has been 
identified. They therefore work with so-called 
shape tokens, which give a very approximate en- 



coding of the characters' shapes on the printed 
page without needing full-scale OCR. For exam- 
ple, all upper case letters are treated as being 
one character shape, all characters with a de- 
scender are another, and so on. Sequences of 
character shape codes separated by white space 
are assembled into word shape tokens. Sibun 
and Spitz then determine the language on the 
basis of linear discriminant analysis (LDA) over 
word shape tokens, while Sibun and Reynar ex- 
plore the use of entropy relative to training data 
for character shape unigrams, bigrams and tri- 
grams. Both techniques are capable of over 
90% accuracy for most languages. However, the 
LDA-based technique tends to perform signifi- 
cantly worse for languages which are similar to 
one another, such as the Norse languages. Rela- 
tive entropy performs better, but still has some 
noticeable error clusters, such as confusion be- 
tween Croatian, Serbian and Slovenian. 

What these techniques lack is a measure of 
when enough information has been accumulated 
to distinguish one language from another reli- 
ably: they examine all of the input data and 
then make the decision. Here we will look at a 
different approach which attempts to overcome 
this by maintaining a measure of the total ev- 
idence accumulated for each language and how 
much confidence there is in the measure. To 
outline the approach: 

1. The input is processed one (word shape) 
token at a time. For each language, we de- 
termine the probability that the token is 
in that language, expressed as a 95% con- 
fidence range. 

2. The values for each word are accumulated 
into an overall score with a confidence 
range for the input to date, and compared 
both to an absolute threshold, and with 



each other. Thus, to select a language, we 
require not only that it has a high score 
(probability, roughly), but also that it is 
significantly better scoring than any other. 

3. If the process fails to make a decision on the 
data that is available, the subset of the lan- 
guages which have exceeded the absolute 
threshold can be output, so that even if a 
final decision has not been made, the likely 
possibilities have been narrowed down. 

We look at this procedure in more detail below, 
with particular emphasis on how the underlying 
statistical model provides confidence intervals. 
An evaluation of the technique on data similar 
to that used by Sibun and Reynar follows^. 

2 The Identification Algorithm 

The essential idea behind the identification al- 
gorithm is to accumulate the probability of the 
language given the input tokens for each lan- 
guage, treating each token as an independent 
event. To obtain the probability of a language 
I given a token t, p(l\t), we use Bayes' rule: 



p(l\t) 



P(t\l)p(l) 
p(t) 



where p(t\l) is the probability of the token if the 
language is known, p(t) is the a priori probabil- 
ity of the token, and p(l) is the a priori probabil- 
ity of the language. We will assume that p{l) is 
constant (all languages are equi-probable) and 
drop it from the computation; in the tests, we 
will use the same amount of training data for 
each language. The other two terms are esti- 
mated from training data, using the procedure 
described in section |2.2| . 

2.1 The language model and the 
algorithm 

The input to the algorithm consists of a stream 
of tokens, such as word shape tokens (as in Si- 
bun and Spitz, or Sibun and Reynar) or words 
themselves. The model for each language con- 
tains the probability of each known token given 
the language, expressed as three values: the ba- 
sic probability, and the lower and upper limits 



1 Some ideas related to the use of confidence limits 
can also be found in Dagan et al. (1991), applied in a 
different area. 



of a range containing this probability for a spe- 
cific level of confidence. We will denote these by 
PB(t\l), Pi{tV)i PH(t\l), for base, low and high 
values. The probability that a token which has 
never been seen before is in a language is also 
present in the model of the language. In ad- 
dition, there is a language independent model, 
containing the p{t) values. No confidence range 
is used for them, although this would be a sim- 
ple extension of the technique. 

The algorithm proceeds by processing tokens, 
building up evidence about each language in 
three accumulators. The accumulators rep- 
resent the overall probability of the language 
given the entire stream of tokens to date, again 
as base, low and high values, denoted as(l), 
ol(0> Gff(Z). They are set to zero at the start 
of processing, and the logarithms of the proba- 
bilities are added to them as each token is pro- 
cessed. By taking logarithms of probabilities, 
we are in effect measuring the amount of evi- 
dence for each language, expressed as informa- 
tion content. Prom a practical point of view, 
using logarithms also helps keep all the values 
in a reasonable range and so avoids numerical 
underflow. 

After processing each token, two tests are ap- 
plied. Firstly, we examine the base accumulator 
for the language which has the highest accumu- 
lated total, and test whether it is greater than a 
fixed threshold, called the activation threshold. 
If it is, then we conclude that enough informa- 
tion has been accumulated to try to make a de- 
cision. The low value for this language ai{l) is 
then compared against the high value an {I') for 
the next best language I', and if ai{l) exceeds 
an (I') language / is output and the algorithm 
halts. Otherwise, the process continues with 
the next token, until the best choice language 
is a clear "winner" over any other. 

Finally, if we reach the end of the input data 
without a decision being made, several options 
are possible, depending on the needs of the ap- 
plication. We can simply output the language 
with the highest base score, even if the second 
test is not satisfied. Alternatively, we can out- 
put the highest scoring language, and all other 
languages whose high probability is greater than 
the low probability of this language. 



2.2 Training the model 

The model is trained using a collection of cor- 
pora for which the correct language is known. 
For a given language I and token t, let f(t, I) be 
the count of the token in that language and f(l) 
be the total count of all tokens in that language. 
f(t) is the count of the token t across all the lan- 
guages, and F the count of all tokens across all 
languages. The probability of the token occur- 
ring in the language p(t\l) is then calculated by 
assuming that the probabilities follow a bino- 
mial distribution. The idea here is that token 
occurrences are binary "events" which are either 
the given token t or are not. For large f(t, I), the 
underlying probability can be calculated by us- 
ing the normal approximation to the binomial, 
giving the base probability 



PB(t\l) 



f(t,l) 
f(l) 



The standard deviation of this quantity is 



a(t, I) = yJf(l)p B (t\W-PB(t\l)) 

The low and high probabilities are found by 
taking a given number of standard deviations 
d from the base probability. 



PL{t\l) 



phW) 



f(t,l)-da(t,l) 
f(l) 

f(t,l)+da(t,l) 
f(l) 



In the evaluation below, d was set to 2, giving 
95% confidence limits. 

For lower values of f(t,l), the calculation of 
the low and high probabilities can be made more 
exact, by substituting them for the base prob- 
ability in the calculation of the standard devia- 
tion, giving 



PL(t\l) 



f(t, I) ~ dy/f(l) PL (t Z)(l-pz(t 0) 



/(0 



PH(t\l) 



f(t,l) + d^f(l)p H (t\l)(l- PH (t\l)) 
/(0 

Approximating 1 — pL,(t\l) and 1 — pn(t\l) to 1 
on the grounds that the probabilities are small, 
and solving the equations gives 



PL(t\l) 



(Vd* + 4f(t,l)-d) 2 
4/(0 



PH(t\l) 



(y/(P+4f{t,l)+d)< 

4/(0 



The calculation requires marginally more com- 
putational effort than the first case, and in prac- 
tice we use it for all but very large values of 
f(t,l), where the approximation of 1 — pt(t\l) 
and 1 — pn(t\l) to 1 would break down. 

For very small values of f(t,l), say less 
than 10, the normal approximation is not good 
enough, and we calculate the probabilities by 
reference to the binomial equation for the prob- 
ability of m (=/(t, 0) successes in n (= /(0) 
trials: 

, . p m (l-p) n - m n\ 

P{m) = -7— 

ml(n — m)\ 

p is the underlying probability of the distribu- 
tion, and this is what we are after. By choos- 
ing values for p(m) and solving to find p we 
can obtain a given confidence range. To ob- 
tain a 95% interval, p(m) is set to 0.025, 0.5 
and 0.975, yielding PL(t\l), pB(t\l), and p#(t|0, 
respectively. In fact, this is not exactly how 
the probability ranges for low frequency items 
should be calculated: instead the cumulative 
probability density function should be calcu- 
lated and the range estimated from it0. For the 
present purposes, the low frequency items do 
not make much of a contribution to the overall 
success rate, and so the approximation is unim- 
portant. However, if similar techniques were ap- 
plied to problems with sparser data, then the 
procedure here would have to be revised. 

Finally, we need a probability for tokens 
which were not seen in the training data, called 
the zero probability, for which we set m = in 
the above equation giving 



p(0|0 = l- \fp^m) 



It is not clear what it means to have a confidence 
measure here, and so we use a single value for 
base, low and high probabilities, obtained by 
setting p(m) to 0.95. 

Similar calculations using f(t) in place of 
f(t,l) and F in place of f(l) give the a priori 
token probabilities pit)- As already noted, base, 
low and high value could have been calculated 
in this case, but as a minor simplification, we 
use only the base probability. 



2 Thanks to one of the referees for pointing this out. 



3 Evaluation 

To evaluate the technique, a test was run using 
similar data to Sibun and Reynar. Corpora for 
eighteen languages from the European Corpus 
Initiative CDROM 1 were extracted and split 
into non-overlapping files, one containing 2000 
tokens^], one containing 200 tokens, and 25 files 
each of 1, 5, 10 and 20 tokens. The 2000 and 
200 token files were used as training data, and 
the remainder for test data. Wherever possible 
the texts were taken from newspaper corpora, 
and failing that from novels or literature. The 
identification algorithm was run on each test file 
and the results placed in one of four categories: 

• Definitive, correct decision made. 

• No decision made by the end of the input, 
but highest scoring language was correct. 

• No decision, highest scoring language in- 
correct. 

• Definitive, incorrect decision made. 

The sum of the first two figures divided by the 
total number of tests gives a measure of accu- 
racy; the sum of the first and last divided by the 
total gives a measure of decisiveness, expressed 
as the proportion of the time a definitive deci- 
sion was made. The tests were executed using 
word shape tokens on the same coding scheme 
as Sibun and Reynar, and using the words as 
they appeared in the corpus. No adjustments 
were made for punctuation, case, etc. Vari- 
ous activation thresholds were tried: raising the 
threshold increases accuracy by requiring more 
information before a decision is made, but re- 
duces decisiveness. With shapes and 2000 to- 
kens of training data, at a threshold of 14 or 
more, all the 20 token files gave 100% accu- 
racy. For words themselves, the threshold was 
set to 22. The results of these tests appear in 
table [|. The figures for the activation threshold 
were determined by experimenting on the data. 
An interesting area for further work would be to 
put this aspect of the procedure on a sounder 
theoretical basis, perhaps by using the a priori 
probabilities of the individual languages. 

3 Sibun and Spitz, and Sibun and Reynar, present 
their results in terms of lines of input, with 1-5 lines 
corresponding roughly to a sentence, and 10-20 lines to 
a paragraph. Estimating a line as 10 words, we are there- 
fore working with significantly smaller data sets. 



The accuracy figures are generally similar to 
or better than those of Sibun and Reynar. The 
corresponding figures for 200 tokens of training 
data appear in table ^, for the token identifica- 
tion task only. 

One of the strengths of the algorithm is that 
it makes a decision as soon as one can be made 
reliably. Table || shows the average number of 
tokens which have to be read before a decision 
can be made, for the cases where the decision 
was correct and incorrect, and for both cases 
together. Again, the results are for word shape 
tokens, and for words alone. The figures show 
that convergence usually happens within about 
10 words, with a long tailing off to the results. 
The longest time to convergence was 153 shape 
tokens. 

A manual inspection of one run (2000 lines 
of training data, tokens, threshold=14) shows 
that errors are somtimes clustered, although 
quite weakly. For example, Serbian, Croatian 
and Slovenian show several confusions between 
them, as in Sibun and Reynar's results. There 
are two observations to be made here. Firstly, 
there are about as many other errors between 
these language and languages which are unre- 
lated to them, such as Italian, German and Nor- 
wegian, and so the errors may be due to poor 
quality data rather than a lack of discrimina- 
tion in the algorithm. For example, Croatian 
is incorrectly recognised as Serbian 3 times and 
as Slovenian once, while the languages which 
are misrecognised as Croatian are German and 
Norwegian (once each). Secondly, even where 
there are errors, the range of possibilities has 
been substantially reduced, so that a more pow- 
erful process (such as full-scale OCR followed 
by identification on words rather than shape to- 
kens, or a raising of the threshold and adding 
more data) could be brought in to finish the 
job off. That is, the confidence limits have pro- 
vided a benefit in reducing the search space. 
The confusion matrix for this case appears in 
an appendix. 

3.1 Broader applicability 

Although the algorithm was developed with lan- 
guage identification in mind, it is interesting to 
explore other classification problems with it. A 
simple and rather crude experiment in "genre" 
identification was carried out, using the Brown 
corpus. Each section of the corpus (labelled A, 
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Table 1: Performance with 2000 tokens of training data 
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Table 2: Performance with 200 tokens of training data (word shape tokens only) 



B, C ... R in the original) was taken as a genre, 
and files of similar distribution to the previous 
experiment were extracted. Because this is a 
more unconstrained problem, the training set 
and tests sets were about 10 times the size of 
the language identification task. A 20000 word 
file was used as training data, and the remain- 
ing files as test data. Accuracy and decisive- 
ness results appear in table |j. Beyond the ac- 
tivation threshold of 12, there is no significant 
improvement in accuracy. The technique seems 
to give good accuracy when there is sufficient 
input (100 words or more), but at the cost of 
very low decisiveness. Excluding a fixed list of 
common words such as function words might 
increase the decisiveness. These results should 
be taken with a pinch of salt, as the notion of 
genre is not very well-defined, and it is not clear 
that sections of the Brown corpus really repre- 
sent coherent categories, but they may provide 
a starting point for further investigation. 

3.2 On decisiveness 

Decisiveness represents the degree to which a 
unique decision has been made with a high de- 
gree of confidence. In cases where no unique de- 
cision has been made, the range of possibilities 



will often have been reduced: a category is only 
still possible at any stage if its high accumula- 
tor value is greater than the low accumulator 
value of the best rated category. To illustrate 
this, the number of categories which are still 
possible when all the input was exhausted was 
examined. The results appear in tables || and 
^, for the tests of language identification from 
word shape tokens with an activation threshold 
of 14 and a training set of 2000 tokens, and for 
genre identification with a threshold of 12 and a 
training set of 20000 tokens. Results are shown 
for the cases of a correct decision, an incorrect 
one, and all cases. The average number of 
possibilities remaining is 1.3 out of 18 for the 
language identification test, and 9.7 out of 15 
for the genre test, showing that we are generally 
near to convergence in the former case, but have 
only achieved a small reduction in the possibil- 
ities in the latter, in keeping with the generally 
low decisiveness. 

3.3 A further comparison 

The classification algorithm described above 
was originally developed in response to Sibun 
and Spitz's work. There is another approach 
to language identification, which has a certain 



Threshold 


Shape tokens 


Words 


Correct 


Incorrect 


All 


Correct 


Incorrect 


All 





3.22 


1.23 


2.65 


1.81 


1.07 


1.66 


10 


7.33 


4.55 


7.28 


5.31 


3.88 


5.28 


14 


9.35 


6.50 


9.33 








22 








10.6 


8.00 


10.6 



Table 3: Average number of tokens read before convergence 
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Table 4: Performance on genre identification 
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Table 5: Categories remaining at end of in- 
put (language identification from word shape 
tokens) 
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Table 6: Categories remaining at end of input 
(genre identification) 



amount in common with ours, described in a 
patent by Martino and Paulsen (1996). Their 
approach is to build tables of the most frequent 
words in each language, and assign them a nor- 
malised score, based on the frequency of occur- 
rence of the word in one language compared to 



the total across all the languages. Only the 
most frequent words for each language are used. 
The algorithm works by accumulating scores, 
until a preset number of words has been read 
or a minimum score has been reached. They 
also apply the technique to genre identification. 
Since there is a clear similarity, it is perhaps 



worth highlighting the differences. In terms of 
the algorithm, the most important difference is 
that no confidence measures are included. The 
complexities of splitting the data into different 
frequency bands for calculating probabilities are 
thus avoided, but no test analogous to overlap- 
ping confidence intervals can be applied. Mar- 
tino and Paulsen say they obtain a high degree 
of confidence in the decision after about 100 
words, without saying what the actual success 
rate is; we can compare this with around 10 
words (or tokens) for convergence here. 

4 Conclusions 

We have examined a simple technique for clas- 
sifying a stream of input tokens in which con- 
fidence measures are used to determine when a 
correct decision can be made. The results in 
table H show that there is a tradeoff between 
accuracy and the degree to which the algorithm 
selects a single language. Not surprisingly, the 
amount of training data also affects the per- 
formance, with 2000 tokens being adequate for 
accuracy close to 100%, and convergence typi- 
cally being reached in the first 10 tokens. On 
a more unconstrained problem, such as genre 
identification from words alone, the algorithm 
performs less well in both accuracy and deci- 
siveness even with significantly more training 
data, and is probably not adequate except as a 
preprocessor to some more knowledge intensive 
technique. 

In a sense, language identification is not a 
very interesting problem. As we have noted, 
there are plenty of techniques which work well, 
each with its own characteristics and suitability 
for different application areas. What is perhaps 
more important is the way the statistical infor- 
mation has been used here. When we take a 
statistical or data-led approach to NLP, there 
are two things which can help us trust that 
the technique is accurate. The first is a be- 
lief that the statistical technique is an adequate 
model of the underlying process which "gener- 
ates" the data, using theoretical considerations 
or some external source of knowledge to inform 
this belief. The second is quantitative evalua- 
tion on test data which has been characterised 
by an outside source (for example, in the case of 
part of speech tagging, a corpus which has been 
manually annotated, or at least automatically 



tagged and manually corrected). The problem 
with quantitative evaluation is that we do not 
know whether it will generalise, so that if we 
train on one data set, we have only the theo- 
retical model to reassure that the same model 
will work on a different data set. The idea I 
have been presenting here is to get the statisti- 
cal process itself to provide feedback about it- 
self, through the use of confidence limits which 
are themselves based in the statistical model. In 
doing so, we hope to avoid presenting a result 
for which we lack adequate evidence. 
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Appendix 

Confusion matrix for the case of 2000 lines of 
training data, token, threshold=14. An entry 
in this matrix means that the language on the 
horizontal axis was classified as being in the lan- 
guage on the vertical axis in the indicated num- 
ber of test samples. 

(alb = Albanian, cro = Croatian, dan = Dan- 
ish, dut = Dutch, eng = English, est = Esto- 
nian, fre = French, ger = German, ita = Italian, 
lat = Latin, lit = Lithuanian, mal = Malay, nor 
= Norwegian, por = Portugese, ser = Serbian, 
slo = Slovenian, spa = Spanish, tur = Turk- 
ish. Some of the languages are in a Romanised 
form.) 
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